Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cdata codegen, with eager output support #30

Merged
merged 15 commits into from
Oct 13, 2024

Conversation

silentbicycle
Copy link

This is an alternative for -lc and -lvmc that avoids very expensive compilation when the resulting C output is quite large. For this mode, most of the output is C data literals (a couple structs tables), followed by a very small (~50 loc) interpreter for the data. This is much faster to compile -- for a data set I'm working with now, it's 30 seconds to build compared to several hours and/or gcc exhausting memory.

Generating output with comments enabled will include inline comments about the format, along with per-state comments showing labels, endids, and eager outputs. It will only generate code for endids and eager outputs if the DFA has them.

This is experimental. I expect the interfaces will change a bit in the near future, and I am still working on performance tuning.

There is some code to detect and reuse repeated runs of IDs in the output tables, but there is a bug leading to them not being terminated properly (possibly causing false positives), so it's currently disabled.

To see a good example of the format, with comments, run:
build/bin//re -rpcre -lcdata -u '^abc'


Draft: This is currently targeting the sv/eager-outputs branch, because it depends on changes added there. Once that has been reviewed and merged I will incorporate any changes from its review and retarget this to main.

This is an alternative for -lc and -lvmc that avoids very expensive
compilation when the resulting C output is quite large. For this
mode, most of the output is C data literals (a couple structs tables),
followed by a very small (~50 loc) interpreter for the data. This is
much faster to compile -- for a data set I'm working with now, it's
30 seconds to build compared to several hours and/or gcc exhausting
memory.

Generating output with comments enabled will include inline comments
about the format, along with per-state comments showing labels,
endids, and eager outputs. It will only generate code for endids
and eager outputs if the DFA has them.

This is experimental. I expect the interfaces will change a bit in
the near future, and I am still working on performance tuning.

There is some code to detect and reuse repeated runs of IDs in the
output tables, but there is a bug leading to them not being
terminated properly (possibly causing false positives), so it's
currently disabled.

To see a good example of the format, with comments, run:
    build/bin//re -rpcre -lcdata -u '^abc'
fuzz/target.c Outdated
@@ -446,6 +446,8 @@ fuzz_eager_output(const uint8_t *data, size_t size)

size_t max_pattern_length = 0;

const unsigned seed = size == 0 ? 0 : data[0];
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I guess we'll srand() here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll add that before I switch from a draft PR.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now done in e133a74 on #29.

@@ -124,6 +124,7 @@ lang_name(const char *name, enum fsm_print_lang *fsm_lang, enum ast_print_lang *
{ "rust", FSM_PRINT_RUST },
{ "sh", FSM_PRINT_SH },
{ "vmc", FSM_PRINT_VMC },
{ "cdata", FSM_PRINT_CDATA },
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and rx, and retest too please!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rx: Added in 3ea0898.

retest: Added in 7a935b4, and that found a couple bugs because it exercised cdata output more widely.

src/libfsm/print/cdata.c Outdated Show resolved Hide resolved
fsm_generate_matches is no longer seeding `rand()` directly.
I confirmed that the callers actually check the return.

These should probably use the alloc interface (with `f_realloc`),
but the existing print callback typedefs don't seem to pass along
the alloc handle anymore, since it was removed from fsm_options,
so if that changes it will be in a different commit.
Base automatically changed from sv/eager-outputs to main October 12, 2024 19:00
@silentbicycle silentbicycle marked this pull request as ready for review October 12, 2024 19:05
While technically equivalent, this looks confusnig. I'm not sure if it
was a typo (thinking of `eager_output_buf.used > 0`) or some kind of
search/replace artifact.
@silentbicycle silentbicycle merged commit 7cb37be into main Oct 13, 2024
346 checks passed
@silentbicycle silentbicycle deleted the sv/add-cdata-codegen-with-eager-outputs branch October 13, 2024 15:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants